{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/master/tutorials/W2D5_ReinforcementLearning/W2D5_Tutorial4.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "# Neuromatch Academy: Week 3, Day 4, Tutorial 4\n",
    "# From Reinforcement Learning to Planning\n",
    "\n",
    "__Content creators:__ Marcelo Mattar and Eric DeWitt with help from Byron Galbraith\n",
    "\n",
    "__Content reviewers:__ Matthew Krause and Michael Waskom"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "\n",
    "# Tutorial Objectives\n",
    "  \n",
    "In this tutorial you will implement one of the simplest model-based Reinforcement Learning algorithms, Dyna-Q. You will understand what a world model is, how it can improve the agent's policy, and the situations in which model-based algorithms are more advantagenous than their model-free counterparts.\n",
    "    \n",
    "* You will implement a model-based RL agent, Dyna-Q, that can solve a simple task;\n",
    "* You will investigate the effect of planning on the agent's behavior;\n",
    "* You will compare the behaviors of a model-based and model-free agent in light of an environmental change."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:07.649223Z",
     "iopub.status.busy": "2021-05-25T01:20:07.648690Z",
     "iopub.status.idle": "2021-05-25T01:20:08.193340Z",
     "shell.execute_reply": "2021-05-25T01:20:08.193991Z"
    }
   },
   "outputs": [],
   "source": [
    "# Imports\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.signal import convolve as conv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.204119Z",
     "iopub.status.busy": "2021-05-25T01:20:08.203593Z",
     "iopub.status.idle": "2021-05-25T01:20:08.242398Z",
     "shell.execute_reply": "2021-05-25T01:20:08.241673Z"
    }
   },
   "outputs": [],
   "source": [
    "#@title Figure settings\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.256203Z",
     "iopub.status.busy": "2021-05-25T01:20:08.250578Z",
     "iopub.status.idle": "2021-05-25T01:20:08.285567Z",
     "shell.execute_reply": "2021-05-25T01:20:08.285053Z"
    }
   },
   "outputs": [],
   "source": [
    "#@title Helper functions\n",
    "def epsilon_greedy(q, epsilon):\n",
    "  \"\"\"Epsilon-greedy policy: selects the maximum value action with probabilty\n",
    "  (1-epsilon) and selects randomly with epsilon probability.\n",
    "\n",
    "  Args:\n",
    "    q (ndarray): an array of action values\n",
    "    epsilon (float): probability of selecting an action randomly\n",
    "\n",
    "  Returns:\n",
    "    int: the chosen action\n",
    "  \"\"\"\n",
    "  be_greedy = np.random.random() > epsilon\n",
    "  if be_greedy:\n",
    "    action = np.argmax(q)\n",
    "  else:\n",
    "    action = np.random.choice(len(q))\n",
    "\n",
    "  return action\n",
    "\n",
    "\n",
    "def q_learning(state, action, reward, next_state, value, params):\n",
    "  \"\"\"Q-learning: updates the value function and returns it.\n",
    "\n",
    "  Args:\n",
    "    state (int): the current state identifier\n",
    "    action (int): the action taken\n",
    "    reward (float): the reward received\n",
    "    next_state (int): the transitioned to state identifier\n",
    "    value (ndarray): current value function of shape (n_states, n_actions)\n",
    "    params (dict): a dictionary containing the default parameters\n",
    "\n",
    "  Returns:\n",
    "    ndarray: the updated value function of shape (n_states, n_actions)\n",
    "  \"\"\"\n",
    "  # value of previous state-action pair\n",
    "  prev_value = value[int(state), int(action)]\n",
    "\n",
    "  # maximum Q-value at current state\n",
    "  if next_state is None or np.isnan(next_state):\n",
    "      max_value = 0\n",
    "  else:\n",
    "      max_value = np.max(value[int(next_state)])\n",
    "\n",
    "  # reward prediction error\n",
    "  delta = reward + params['gamma'] * max_value - prev_value\n",
    "\n",
    "  # update value of previous state-action pair\n",
    "  value[int(state), int(action)] = prev_value + params['alpha'] * delta\n",
    "\n",
    "  return value\n",
    "\n",
    "\n",
    "def learn_environment(env, model_updater, planner, params, max_steps,\n",
    "                      n_episodes, shortcut_episode=None):\n",
    "  # Start with a uniform value function\n",
    "  value = np.ones((env.n_states, env.n_actions))\n",
    "\n",
    "  # Run learning\n",
    "  reward_sums = np.zeros(n_episodes)\n",
    "  episode_steps = np.zeros(n_episodes)\n",
    "\n",
    "  # Dyna-Q state\n",
    "  model = np.nan*np.zeros((env.n_states, env.n_actions, 2))\n",
    "\n",
    "  # Loop over episodes\n",
    "  for episode in range(n_episodes):\n",
    "    if shortcut_episode is not None and episode == shortcut_episode:\n",
    "      env.toggle_shortcut()\n",
    "      state = 64\n",
    "      action = 1\n",
    "      next_state, reward = env.get_outcome(state, action)\n",
    "      model[state, action] = reward, next_state\n",
    "      value = q_learning(state, action, reward, next_state, value, params)\n",
    "\n",
    "\n",
    "    state = env.init_state  # initialize state\n",
    "    reward_sum = 0\n",
    "\n",
    "    for t in range(max_steps):\n",
    "      # choose next action\n",
    "      action = epsilon_greedy(value[state], params['epsilon'])\n",
    "\n",
    "      # observe outcome of action on environment\n",
    "      next_state, reward = env.get_outcome(state, action)\n",
    "\n",
    "      # sum rewards obtained\n",
    "      reward_sum += reward\n",
    "\n",
    "      # update value function\n",
    "      value = q_learning(state, action, reward, next_state, value, params)\n",
    "\n",
    "      # update model\n",
    "      model = model_updater(model, state, action, reward, next_state)\n",
    "      # execute planner\n",
    "      value = planner(model, value, params)\n",
    "\n",
    "      if next_state is None:\n",
    "        break  # episode ends\n",
    "      state = next_state\n",
    "\n",
    "    reward_sums[episode] = reward_sum\n",
    "    episode_steps[episode] = t+1\n",
    "\n",
    "  return value, reward_sums, episode_steps\n",
    "\n",
    "\n",
    "class world(object):\n",
    "    def __init__(self):\n",
    "        return\n",
    "\n",
    "    def get_outcome(self):\n",
    "        print(\"Abstract method, not implemented\")\n",
    "        return\n",
    "\n",
    "    def get_all_outcomes(self):\n",
    "        outcomes = {}\n",
    "        for state in range(self.n_states):\n",
    "            for action in range(self.n_actions):\n",
    "                next_state, reward = self.get_outcome(state, action)\n",
    "                outcomes[state, action] = [(1, next_state, reward)]\n",
    "        return outcomes\n",
    "\n",
    "class QuentinsWorld(world):\n",
    "    \"\"\"\n",
    "    World: Quentin's world.\n",
    "    100 states (10-by-10 grid world).\n",
    "    The mapping from state to the grid is as follows:\n",
    "    90 ...       99\n",
    "    ...\n",
    "    40 ...       49\n",
    "    30 ...       39\n",
    "    20 21 22 ... 29\n",
    "    10 11 12 ... 19\n",
    "    0  1  2  ...  9\n",
    "    54 is the start state.\n",
    "    Actions 0, 1, 2, 3 correspond to right, up, left, down.\n",
    "    Moving anywhere from state 99 (goal state) will end the session.\n",
    "    Landing in red states incurs a reward of -1.\n",
    "    Landing in the goal state (99) gets a reward of 1.\n",
    "    Going towards the border when already at the border will stay in the same\n",
    "        place.\n",
    "    \"\"\"\n",
    "    def __init__(self):\n",
    "        self.name = \"QuentinsWorld\"\n",
    "        self.n_states = 100\n",
    "        self.n_actions = 4\n",
    "        self.dim_x = 10\n",
    "        self.dim_y = 10\n",
    "        self.init_state = 54\n",
    "        self.shortcut_state = 64\n",
    "\n",
    "    def toggle_shortcut(self):\n",
    "      if self.shortcut_state == 64:\n",
    "        self.shortcut_state = 2\n",
    "      else:\n",
    "        self.shortcut_state = 64\n",
    "\n",
    "    def get_outcome(self, state, action):\n",
    "        if state == 99:  # goal state\n",
    "            reward = 0\n",
    "            next_state = None\n",
    "            return next_state, reward\n",
    "        reward = 0  # default reward value\n",
    "        if action == 0:  # move right\n",
    "            next_state = state + 1\n",
    "            if state == 98:  # next state is goal state\n",
    "                reward = 1\n",
    "            elif state % 10 == 9:  # right border\n",
    "                next_state = state\n",
    "            elif state in [11, 21, 31, 41, 51, 61, 71,\n",
    "                           12, 72,\n",
    "                           73,\n",
    "                           14, 74,\n",
    "                           15, 25, 35, 45, 55, 65, 75]:  # next state is red\n",
    "                reward = -1\n",
    "        elif action == 1:  # move up\n",
    "            next_state = state + 10\n",
    "            if state == 89:  # next state is goal state\n",
    "                reward = 1\n",
    "            if state >= 90:  # top border\n",
    "                next_state = state\n",
    "            elif state in [2, 12, 22, 32, 42, 52, 62,\n",
    "                           3, 63,\n",
    "                           self.shortcut_state,\n",
    "                           5, 65,\n",
    "                           6, 16, 26, 36, 46, 56, 66]:  # next state is red\n",
    "                reward = -1\n",
    "        elif action == 2:  # move left\n",
    "            next_state = state - 1\n",
    "            if state % 10 == 0:  # left border\n",
    "                next_state = state\n",
    "            elif state in [17, 27, 37, 47, 57, 67, 77,\n",
    "                           16, 76,\n",
    "                           75,\n",
    "                           14, 74,\n",
    "                           13, 23, 33, 43, 53, 63, 73]:  # next state is red\n",
    "                reward = -1\n",
    "        elif action == 3:  # move down\n",
    "            next_state = state - 10\n",
    "            if state <= 9:  # bottom border\n",
    "                next_state = state\n",
    "            elif state in [22, 32, 42, 52, 62, 72, 82,\n",
    "                           23, 83,\n",
    "                           84,\n",
    "                           25, 85,\n",
    "                           26, 36, 46, 56, 66, 76, 86]:  # next state is red\n",
    "                reward = -1\n",
    "        else:\n",
    "            print(\"Action must be between 0 and 3.\")\n",
    "            next_state = None\n",
    "            reward = None\n",
    "        return int(next_state) if next_state is not None else None, reward\n",
    "\n",
    "\n",
    "# HELPER FUNCTIONS FOR PLOTTING\n",
    "\n",
    "def plot_state_action_values(env, value, ax=None):\n",
    "  \"\"\"\n",
    "  Generate plot showing value of each action at each state.\n",
    "  \"\"\"\n",
    "  if ax is None:\n",
    "    fig, ax = plt.subplots()\n",
    "\n",
    "  for a in range(env.n_actions):\n",
    "    ax.plot(range(env.n_states), value[:, a], marker='o', linestyle='--')\n",
    "  ax.set(xlabel='States', ylabel='Values')\n",
    "  ax.legend(['R','U','L','D'], loc='lower right')\n",
    "\n",
    "\n",
    "def plot_quiver_max_action(env, value, ax=None):\n",
    "  \"\"\"\n",
    "  Generate plot showing action of maximum value or maximum probability at\n",
    "    each state (not for n-armed bandit or cheese_world).\n",
    "  \"\"\"\n",
    "  if ax is None:\n",
    "    fig, ax = plt.subplots()\n",
    "\n",
    "  X = np.tile(np.arange(env.dim_x), [env.dim_y,1]) + 0.5\n",
    "  Y = np.tile(np.arange(env.dim_y)[::-1][:,np.newaxis], [1,env.dim_x]) + 0.5\n",
    "  which_max = np.reshape(value.argmax(axis=1), (env.dim_y,env.dim_x))\n",
    "  which_max = which_max[::-1,:]\n",
    "  U = np.zeros(X.shape)\n",
    "  V = np.zeros(X.shape)\n",
    "  U[which_max == 0] = 1\n",
    "  V[which_max == 1] = 1\n",
    "  U[which_max == 2] = -1\n",
    "  V[which_max == 3] = -1\n",
    "\n",
    "  ax.quiver(X, Y, U, V)\n",
    "  ax.set(\n",
    "      title='Maximum value/probability actions',\n",
    "      xlim=[-0.5, env.dim_x+0.5],\n",
    "      ylim=[-0.5, env.dim_y+0.5],\n",
    "  )\n",
    "  ax.set_xticks(np.linspace(0.5, env.dim_x-0.5, num=env.dim_x))\n",
    "  ax.set_xticklabels([\"%d\" % x for x in np.arange(env.dim_x)])\n",
    "  ax.set_xticks(np.arange(env.dim_x+1), minor=True)\n",
    "  ax.set_yticks(np.linspace(0.5, env.dim_y-0.5, num=env.dim_y))\n",
    "  ax.set_yticklabels([\"%d\" % y for y in np.arange(0, env.dim_y*env.dim_x, env.dim_x)])\n",
    "  ax.set_yticks(np.arange(env.dim_y+1), minor=True)\n",
    "  ax.grid(which='minor',linestyle='-')\n",
    "\n",
    "\n",
    "def plot_heatmap_max_val(env, value, ax=None):\n",
    "  \"\"\"\n",
    "  Generate heatmap showing maximum value at each state\n",
    "  \"\"\"\n",
    "  if ax is None:\n",
    "    fig, ax = plt.subplots()\n",
    "\n",
    "  if value.ndim == 1:\n",
    "      value_max = np.reshape(value, (env.dim_y,env.dim_x))\n",
    "  else:\n",
    "      value_max = np.reshape(value.max(axis=1), (env.dim_y,env.dim_x))\n",
    "  value_max = value_max[::-1,:]\n",
    "\n",
    "  im = ax.imshow(value_max, aspect='auto', interpolation='none', cmap='afmhot')\n",
    "  ax.set(title='Maximum value per state')\n",
    "  ax.set_xticks(np.linspace(0, env.dim_x-1, num=env.dim_x))\n",
    "  ax.set_xticklabels([\"%d\" % x for x in np.arange(env.dim_x)])\n",
    "  ax.set_yticks(np.linspace(0, env.dim_y-1, num=env.dim_y))\n",
    "  if env.name != 'windy_cliff_grid':\n",
    "      ax.set_yticklabels(\n",
    "          [\"%d\" % y for y in np.arange(\n",
    "              0, env.dim_y*env.dim_x, env.dim_x)][::-1])\n",
    "  return im\n",
    "\n",
    "\n",
    "def plot_rewards(n_episodes, rewards, average_range=10, ax=None):\n",
    "  \"\"\"\n",
    "  Generate plot showing total reward accumulated in each episode.\n",
    "  \"\"\"\n",
    "  if ax is None:\n",
    "    fig, ax = plt.subplots()\n",
    "\n",
    "  smoothed_rewards = (conv(rewards, np.ones(average_range), mode='same')\n",
    "                      / average_range)\n",
    "\n",
    "  ax.plot(range(0, n_episodes, average_range),\n",
    "          smoothed_rewards[0:n_episodes:average_range],\n",
    "          marker='o', linestyle='--')\n",
    "  ax.set(xlabel='Episodes', ylabel='Total reward')\n",
    "\n",
    "\n",
    "def plot_performance(env, value, reward_sums):\n",
    "  fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(16, 12))\n",
    "  plot_state_action_values(env, value, ax=axes[0,0])\n",
    "  plot_quiver_max_action(env, value, ax=axes[0,1])\n",
    "  plot_rewards(n_episodes, reward_sums, ax=axes[1,0])\n",
    "  im = plot_heatmap_max_val(env, value, ax=axes[1,1])\n",
    "  fig.colorbar(im)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "\n",
    "# Section 1: Model-based RL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 519
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.293479Z",
     "iopub.status.busy": "2021-05-25T01:20:08.292958Z",
     "iopub.status.idle": "2021-05-25T01:20:08.326374Z",
     "shell.execute_reply": "2021-05-25T01:20:08.326946Z"
    },
    "outputId": "c3fa1dec-7e0a-4225-f577-c781fced323c"
   },
   "outputs": [],
   "source": [
    "#@title Video 1: Model-based RL\n",
    "# Insert the ID of the corresponding youtube video\n",
    "from IPython.display import YouTubeVideo\n",
    "video = YouTubeVideo(id=\"zT_legTotF0\", width=854, height=480, fs=1)\n",
    "print(\"Video available at https://youtu.be/\" + video.id)\n",
    "video"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The algorithms introduced in the previous tutorials are all *model-free*, as they do not require a model to use or control behavior. In this section, we will study a different class of algorithms called model-based. As we will see next, in contrast to model-free RL, model-based methods use a model to build a policy.\n",
    "\n",
    "But what is a model? A model (sometimes called a world model or internal model) is a representation of how the world will respond to the agent's actions. You can think of it as a representation of how the world *works*. With such a representation, the agent can simulate new experiences and learn from these simulations. This is advantageous for two reasons. First, acting in the real world can be costly and sometimes even dangerous: remember Cliff World from Tutorial 3? Learning from simulated experience can avoid some of these costs or risks. Second, simulations make fuller use of one's limited experience. To see why, imagine an agent interacting with the real world. The information acquired with each individual action can only be assimilated at the moment of the interaction. In contrast, the experiences simulated from a model can be simulated multiple times -- and whenever desired -- allowing for the information to be more fully assimilated."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Section 1.1 Quentin's World Environment\n",
    "\n",
    "In this tutorial, our RL agent will act in the Quentin's world, a 10x10 grid world. \n",
    "\n",
    "<img alt=\"QuentinsWorld\" width=\"560\" height=\"560\" src=\"https://github.com/NeuromatchAcademy/course-content/blob/master/tutorials/W2D5_ReinforcementLearning/static/W2D5_Tutorial4_QuentinsWorld.png?raw=true\">\n",
    "\n",
    "In this environment, there are 100 states and 4 possible actions: right, up, left, and down. The goal of the agent is to move, via a series of steps, from the start (green) location to the goal (yellow) region, while avoiding the red walls. More specifically:\n",
    "* The agent starts in the green state,\n",
    "* Moving into one of the red states incurs a reward of -1,\n",
    "* Moving into the world borders stays in the same place,\n",
    "* Moving into the goal state (yellow square in the upper right corner) gives you a reward of 1, and\n",
    "* Moving anywhere from the goal state ends the episode.\n",
    "\n",
    "Now that we have our environment and task defined, how can we solve this using a model-based RL agent?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 2: Dyna-Q\n",
    "\n",
    "In this section, we will implement Dyna-Q, one of the simplest model-based reinforcement learning algorithms. A Dyna-Q agent combines acting, learning, and planning. The first two components -- acting and learning -- are just like what we have studied previously. Q-learning, for example, learns by acting in the world, and therefore combines acting and learning. But a Dyna-Q agent also implements planning, or simulating experiences from a model--and learns from them. \n",
    "\n",
    "In theory, one can think of a Dyna-Q agent as implementing acting, learning, and planning simultaneously, at all times. But, in practice, one needs to specify the algorithm as a sequence of steps. The most common way in which the Dyna-Q agent is implemented is by adding a planning routine to a Q-learning agent: after the agent acts in the real world and learns from the observed experience, the agent is allowed a series of $k$ *planning steps*. At each one of those $k$ planning steps, the model generates a simulated experience by randomly sampling from the history of all previously experienced state-action pairs. The agent then learns from this simulated experience, again using the same Q-learning rule that you implemented for learning from real experience. This simulated experience is simply a one-step transition, i.e., a state, an action, and the resulting state and reward. So, in practice, a Dyna-Q agent learns (via Q-learning) from one step of **real** experience during acting, and then from k steps of **simulated** experience during planning.\n",
    "\n",
    "There's one final detail about this algorithm: where does the simulated experiences come from or, in other words, what is the \"model\"? In Dyna-Q, as the agent interacts with the environment, the agent also learns the model. For simplicity, Dyna-Q implements model-learning in an almost trivial way, as simply caching the results of each transition. Thus, after each one-step transition in the environment, the agent saves the results of this transition in a big matrix, and consults that matrix during each of the planning steps. Obviously, this model-learning strategy only makes sense if the world is deterministic (so that each state-action pair always leads to the same state and reward), and this is the setting of the exercise below. However, even this simple setting can already highlight one of Dyna-Q major strengths: the fact that the planning is done at the same time as the agent interacts with the environment, which means that new information gained from the interaction may change the model and thereby interact with planning in potentially interesting ways.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Since you already implemented Q-learning in the previous tutorial, we will focus here on the extensions new to Dyna-Q: the model update step and the planning step. For reference, here's the Dyna-Q algorithm that you will help implement:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "**TABULAR DYNA-Q**\n",
    "\n",
    "Initialize $Q(s,a)$ and $Model(s,a)$ for all $s \\in S$ and $a \\in A$.\n",
    "\n",
    "Loop forever:\n",
    "\n",
    "> (a) $S$ &larr; current (nonterminal) state <br>\n",
    "> (b) $A$ &larr; $\\epsilon$-greedy$(S,Q)$ <br>\n",
    "> (c) Take action $A$; observe resultant reward, $R$, and state, $S'$ <br>\n",
    "> (d) $Q(S,A)$ &larr; $Q(S,A) + \\alpha \\left[R + \\gamma \\max_{a} Q(S',a) - Q(S,A)\\right]$ <br>\n",
    "> (e) $Model(S,A)$ &larr; $R,S'$ (assuming deterministic environment) <br>\n",
    "> (f) Loop repeat $k$ times: <br>\n",
    ">> $S$ &larr; random previously observed state <br>\n",
    ">> $A$ &larr; random action previously taken in $S$ <br>\n",
    ">> $R,S'$ &larr; $Model(S,A)$ <br>\n",
    ">> $Q(S,A)$ &larr; $Q(S,A) + \\alpha \\left[R + \\gamma \\max_{a} Q(S',a) - Q(S,A)\\right]$ <br>\n",
    "\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Exercise 1: Dyna-Q Model Update\n",
    "\n",
    "In this exercise you will implement the model update portion of the Dyna-Q algorithm. More specifically, after each action that the agent executes in the world, we need to update our model to remember what reward and next state we last experienced for the given state-action pair."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.332617Z",
     "iopub.status.busy": "2021-05-25T01:20:08.331303Z",
     "iopub.status.idle": "2021-05-25T01:20:08.333375Z",
     "shell.execute_reply": "2021-05-25T01:20:08.333801Z"
    }
   },
   "outputs": [],
   "source": [
    "def dyna_q_model_update(model, state, action, reward, next_state):\n",
    "  \"\"\" Dyna-Q model update\n",
    "\n",
    "  Args:\n",
    "    model (ndarray): An array of shape (n_states, n_actions, 2) that represents\n",
    "                     the model of the world i.e. what reward and next state do\n",
    "                     we expect from taking an action in a state.\n",
    "    state (int): the current state identifier\n",
    "    action (int): the action taken\n",
    "    reward (float): the reward received\n",
    "    next_state (int): the transitioned to state identifier\n",
    "\n",
    "  Returns:\n",
    "    ndarray: the updated model\n",
    "  \"\"\"\n",
    "  ###############################################################\n",
    "  ## TODO for students: implement the model update step of Dyna-Q\n",
    "  # Fill out function and remove\n",
    "  raise NotImplementedError(\"Student exercise: implement the model update step of Dyna-Q\")\n",
    "  ###############################################################\n",
    "  # Update our model with the observed reward and next state\n",
    "  model[...] = ...\n",
    "\n",
    "  return model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.338901Z",
     "iopub.status.busy": "2021-05-25T01:20:08.337756Z",
     "iopub.status.idle": "2021-05-25T01:20:08.339481Z",
     "shell.execute_reply": "2021-05-25T01:20:08.339997Z"
    }
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "def dyna_q_model_update(model, state, action, reward, next_state):\n",
    "  \"\"\" Dyna-Q model update\n",
    "\n",
    "  Args:\n",
    "    model (ndarray): An array of shape (n_states, n_actions, 2) that represents\n",
    "                     the model of the world i.e. what reward and next state do\n",
    "                     we expect from taking an action in a state.\n",
    "    state (int): the current state identifier\n",
    "    action (int): the action taken\n",
    "    reward (float): the reward received\n",
    "    next_state (int): the transitioned to state identifier\n",
    "\n",
    "  Returns:\n",
    "    ndarray: the updated model\n",
    "  \"\"\"\n",
    "  # Update our model with the observed reward and next state\n",
    "  model[state, action] = reward, next_state\n",
    "\n",
    "  return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Now that we have a way to update our model, we can use it in the planning phase of Dyna-Q to simulate past experiences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Exercise 2: Dyna-Q Planning\n",
    "\n",
    "In this exercise you will implement the other key part of Dyna-Q: planning. We will sample a random state-action pair from those we've experienced, use our model to simulate the experience of taking that action in that state, and update our value function using Q-learning with these simulated state, action, reward, and next state outcomes. Furthermore, we want to run this planning step $k$ times, which can be obtained from `params['k']`.\n",
    "\n",
    "For this exercise, you may use the `q_learning` function to handle the Q-learning value function update. Recall that the method signature is `q_learning(state, action, reward, next_state, value, params)` and it returns the updated `value` table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.346344Z",
     "iopub.status.busy": "2021-05-25T01:20:08.344994Z",
     "iopub.status.idle": "2021-05-25T01:20:08.347103Z",
     "shell.execute_reply": "2021-05-25T01:20:08.347708Z"
    }
   },
   "outputs": [],
   "source": [
    "def dyna_q_planning(model, value, params):\n",
    "  \"\"\" Dyna-Q planning\n",
    "\n",
    "  Args:\n",
    "    model (ndarray): An array of shape (n_states, n_actions, 2) that represents\n",
    "                     the model of the world i.e. what reward and next state do\n",
    "                     we expect from taking an action in a state.\n",
    "    value (ndarray): current value function of shape (n_states, n_actions)\n",
    "    params (dict): a dictionary containing learning parameters\n",
    "\n",
    "  Returns:\n",
    "    ndarray: the updated value function of shape (n_states, n_actions)\n",
    "  \"\"\"\n",
    "  ############################################################\n",
    "  ## TODO for students: implement the planning step of Dyna-Q\n",
    "  # Fill out function and remove\n",
    "  raise NotImplementedError(\"Student exercise: implement the planning step of Dyna-Q\")\n",
    "  #############################################################\n",
    "  # Perform k additional updates at random (planning)\n",
    "  for _ in range(...):\n",
    "    # Find state-action combinations for which we've experienced a reward i.e.\n",
    "    # the reward value is not NaN. The outcome of this expression is an Nx2\n",
    "    # matrix, where each row is a state and action value, respectively.\n",
    "    candidates = np.array(np.where(~np.isnan(model[:,:,0]))).T\n",
    "\n",
    "    # Write an expression for selecting a random row index from our candidates\n",
    "    idx = ...\n",
    "\n",
    "    # Obtain the randomly selected state and action values from the candidates\n",
    "    state, action = ...\n",
    "\n",
    "    # Obtain the expected reward and next state from the model\n",
    "    reward, next_state = ...\n",
    "\n",
    "    # Update the value function using Q-learning\n",
    "    value = ...\n",
    "\n",
    "  return value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.354258Z",
     "iopub.status.busy": "2021-05-25T01:20:08.352832Z",
     "iopub.status.idle": "2021-05-25T01:20:08.355003Z",
     "shell.execute_reply": "2021-05-25T01:20:08.355648Z"
    }
   },
   "outputs": [],
   "source": [
    "# to_remove solution\n",
    "def dyna_q_planning(model, value, params):\n",
    "  \"\"\" Dyna-Q planning\n",
    "\n",
    "  Args:\n",
    "    model (ndarray): An array of shape (n_states, n_actions, 2) that represents\n",
    "                     the model of the world i.e. what reward and next state do\n",
    "                     we expect from taking an action in a state.\n",
    "    value (ndarray): current value function of shape (n_states, n_actions)\n",
    "    params (dict): a dictionary containing learning parameters\n",
    "\n",
    "  Returns:\n",
    "    ndarray: the updated value function of shape (n_states, n_actions)\n",
    "  \"\"\"\n",
    "  # Perform k additional updates at random (planning)\n",
    "  for _ in range(params['k']):\n",
    "    # Find state-action combinations for which we've experienced a reward i.e.\n",
    "    # the reward value is not NaN. The outcome of this expression is an Nx2\n",
    "    # matrix, where each row is a state and action value, respectively.\n",
    "    candidates = np.array(np.where(~np.isnan(model[:,:,0]))).T\n",
    "\n",
    "    # Write an expression for selecting a random row index from our candidates\n",
    "    idx = np.random.choice(len(candidates))\n",
    "\n",
    "    # Obtain the randomly selected state and action values from the candidates\n",
    "    state, action = candidates[idx]\n",
    "\n",
    "    # Obtain the expected reward and next state from the model\n",
    "    reward, next_state = model[state, action]\n",
    "\n",
    "    # Update the value function using Q-learning\n",
    "    value = q_learning(state, action, reward, next_state, value, params)\n",
    "\n",
    "  return value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "With a way to update our model and a means to use it in planning, it is time to see it in action. The following code sets up the our agent parameters and learning environment, then passes your model update and planning methods to the agent to try and solve Quentin's World. Notice that we set the number of planning steps $k=10$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 863
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:08.446679Z",
     "iopub.status.busy": "2021-05-25T01:20:08.431146Z",
     "iopub.status.idle": "2021-05-25T01:20:14.710665Z",
     "shell.execute_reply": "2021-05-25T01:20:14.711130Z"
    },
    "outputId": "bb1aafe5-7d87-43d0-bb9b-098b55cd8b43"
   },
   "outputs": [],
   "source": [
    "# set for reproducibility, comment out / change seed value for different results\n",
    "np.random.seed(1)\n",
    "\n",
    "# parameters needed by our policy and learning rule\n",
    "params = {\n",
    "  'epsilon': 0.05,  # epsilon-greedy policy\n",
    "  'alpha': 0.5,  # learning rate\n",
    "  'gamma': 0.8,  # temporal discount factor\n",
    "  'k': 10,  # number of Dyna-Q planning steps\n",
    "}\n",
    "\n",
    "# episodes/trials\n",
    "n_episodes = 500\n",
    "max_steps = 1000\n",
    "\n",
    "# environment initialization\n",
    "env = QuentinsWorld()\n",
    "\n",
    "# solve Quentin's World using Dyna-Q\n",
    "results = learn_environment(env, dyna_q_model_update, dyna_q_planning,\n",
    "                            params, max_steps, n_episodes)\n",
    "value, reward_sums, episode_steps = results\n",
    "\n",
    "plot_performance(env, value, reward_sums)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Upon completion, we should see that our Dyna-Q agent is able to solve the task quite quickly, achieving a consistent positive reward after only a limited number of episodes (bottom left)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 3: How much to plan?\n",
    "\n",
    "Now that you implemented a Dyna-Q agent with $k=10$, we will try to understand the effect of planning on performance. How does changing the value of $k$ impact our agent's ability to learn?\n",
    "\n",
    "The following code is similar to what we just ran, only this time we run several experiments over several different values of $k$ to see how their average performance compares. In particular, we will choose $k \\in \\{0, 1, 10, 100\\}$. Pay special attention to the case where $k = 0$ which corresponds to no planning. This is, in effect, just regular Q-learning.\n",
    "\n",
    "The following code will take a bit of time to complete. To speed things up, try lowering the number of experiments or the number of $k$ values to compare."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 430
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:20:14.910089Z",
     "iopub.status.busy": "2021-05-25T01:20:14.866520Z",
     "iopub.status.idle": "2021-05-25T01:22:21.986708Z",
     "shell.execute_reply": "2021-05-25T01:22:21.985880Z"
    },
    "outputId": "9cd8206c-6216-41e4-f80f-dd601e442d96"
   },
   "outputs": [],
   "source": [
    "# set for reproducibility, comment out / change seed value for different results\n",
    "np.random.seed(1)\n",
    "\n",
    "# parameters needed by our policy and learning rule\n",
    "params = {\n",
    "  'epsilon': 0.05,  # epsilon-greedy policy\n",
    "  'alpha': 0.5,  # learning rate\n",
    "  'gamma': 0.8,  # temporal discount factor\n",
    "}\n",
    "\n",
    "# episodes/trials\n",
    "n_experiments = 10\n",
    "n_episodes = 100\n",
    "max_steps = 1000\n",
    "\n",
    "# number of planning steps\n",
    "planning_steps = np.array([0, 1, 10, 100])\n",
    "\n",
    "# environment initialization\n",
    "env = QuentinsWorld()\n",
    "\n",
    "steps_per_episode = np.zeros((len(planning_steps), n_experiments, n_episodes))\n",
    "\n",
    "for i, k in enumerate(planning_steps):\n",
    "  params['k'] = k\n",
    "  for experiment in range(n_experiments):\n",
    "    results = learn_environment(env, dyna_q_model_update, dyna_q_planning,\n",
    "                                params, max_steps, n_episodes)\n",
    "    steps_per_episode[i, experiment] = results[2]\n",
    "\n",
    "# Average across experiments\n",
    "steps_per_episode = np.mean(steps_per_episode, axis=1)\n",
    "\n",
    "# Plot results\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(steps_per_episode.T)\n",
    "ax.set(xlabel='Episodes', ylabel='Steps per episode',\n",
    "       xlim=[20, None], ylim=[0, 160])\n",
    "ax.legend(planning_steps, loc='upper right', title=\"Planning steps\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "After an initial warm-up phase of the first 20 episodes, we should see that the number of planning steps has a noticable impact on our agent's ability to rapidly solve the environment. We should also notice that after a certain value of $k$ our relative utility goes down, so it's important to balance a large enough value of $k$ that helps us learn quickly without wasting too much time in planning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 4: When the world changes...\n",
    "\n",
    "In addition to speeding up learning about a new environment, planning can also help the agent to quickly incorporate new information about the environment into its policy. Thus, if the environment changes (e.g. the rules governing the transitions between states, or the rewards associated with each state/action), the agent doesn't need to experience that change *repeatedly* (as would be required in a Q-learning agent) in real experience. Instead, planning allows that change to be incorporated quickly into the agent's policy, without the need to experience the change more than once.\n",
    "\n",
    "In this final section, we will again have our agents attempt to solve Quentin's World. However, after 200 episodes, a shortcut will appear in the environment.  We will test how a model-free agent using Q-learning and a Dyna-Q agent adapt to this change in the environment.\n",
    "\n",
    "<img alt=\"QuentinsWorldShortcut\" width=\"560\" height=\"560\" src=\"https://github.com/NeuromatchAcademy/course-content/blob/master/tutorials/W2D5_ReinforcementLearning/static/W2D5_Tutorial4_QuentinsWorldShortcut.png?raw=true\">\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "The following code again looks similar to what we've run previously. Just as above we will have multiple values for $k$, with $k=0$ representing our Q-learning agent and $k=10$ for our Dyna-Q agent with 10 planning steps. The main difference is we now add in an indicator as to when the shortcut appears. In particular, we will run the agents for 400 episodes, with the shortcut appearing in the middle after episode #200.\n",
    "\n",
    "When this shortcut appears we will also let each agent experience this change once i.e. we will evaluate the act of moving upwards when in the state that is below the now-open shortcut. After this single demonstration, the agents will continue on interacting in the environment.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 430
    },
    "colab_type": "code",
    "execution": {
     "iopub.execute_input": "2021-05-25T01:22:22.069670Z",
     "iopub.status.busy": "2021-05-25T01:22:22.030193Z",
     "iopub.status.idle": "2021-05-25T01:22:25.850169Z",
     "shell.execute_reply": "2021-05-25T01:22:25.849550Z"
    },
    "outputId": "f2160956-74b1-442e-a9de-9312d3b31329"
   },
   "outputs": [],
   "source": [
    "# set for reproducibility, comment out / change seed value for different results\n",
    "np.random.seed(1)\n",
    "\n",
    "# parameters needed by our policy and learning rule\n",
    "params = {\n",
    "  'epsilon': 0.05,  # epsilon-greedy policy\n",
    "  'alpha': 0.5,  # learning rate\n",
    "  'gamma': 0.8,  # temporal discount factor\n",
    "}\n",
    "\n",
    "# episodes/trials\n",
    "n_episodes = 400\n",
    "max_steps = 1000\n",
    "shortcut_episode = 200  # when we introduce the shortcut\n",
    "\n",
    "# number of planning steps\n",
    "planning_steps = np.array([0, 10]) # Q-learning, Dyna-Q (k=10)\n",
    "\n",
    "# environment initialization\n",
    "steps_per_episode = np.zeros((len(planning_steps), n_episodes))\n",
    "\n",
    "# Solve Quentin's World using Q-learning and Dyna-Q\n",
    "for i, k in enumerate(planning_steps):\n",
    "  env = QuentinsWorld()\n",
    "  params['k'] = k\n",
    "  results = learn_environment(env, dyna_q_model_update, dyna_q_planning,\n",
    "                              params, max_steps, n_episodes,\n",
    "                              shortcut_episode=shortcut_episode)\n",
    "  steps_per_episode[i] = results[2]\n",
    "\n",
    "\n",
    "# Plot results\n",
    "fig, ax = plt.subplots()\n",
    "ax.plot(steps_per_episode.T)\n",
    "ax.set(xlabel='Episode', ylabel='Steps per Episode',\n",
    "       xlim=[20,None], ylim=[0, 160])\n",
    "ax.axvline(shortcut_episode, linestyle=\"--\", color='gray', label=\"Shortcut appears\")\n",
    "ax.legend(('Q-learning', 'Dyna-Q', 'Shortcut appears'),\n",
    "          loc='upper right');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "If all went well, we should see the Dyna-Q agent having already achieved near optimal performance before the appearance of the shortcut and then immediately incorporating this new information to further improve. In this case, the Q-learning agent takes much longer to fully incorporate the new shortcut."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Summary\n",
    "\n",
    "In this notebook, you have learned about model-based reinforcement learning and implemented one of the simplest architectures of this type, Dyna-Q. Dyna-Q is very much like Q-learning, but instead of learning only from real experience, you also learn from **simulated** experience. This small difference, however, can have huge benefits! Planning *frees* the agent from the limitation of its own environment, and this in turn allows the agent to speed-up learning -- for instance, effectively incorporating environmental changes into one's policy.\n",
    "\n",
    "Not surprisingly, model-based RL is an active area of research in machine learning. Some of the exciting topics in the frontier of the field involve (i) learning and representing a complex world model (i.e., beyond the tabular and deterministic case above), and (ii) what to simulate -- also known as search control -- (i.e., beyond the random selection of experiences implemented above).\n",
    "\n",
    "The framework above has also been used in neuroscience to explain various phenomena such as planning, memory sampling, memory consolidation, and even dreaming!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W3D4_Tutorial4",
   "provenance": [],
   "toc_visible": true
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
